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CNJ ' Abstract. A new fast (real time) sorter of binary numbers by one-dimensional 

^ I cellular automata is proposed. It sorts a list of n numbers represented by fc-bits 

O ■ each in exactly nk steps. This is only one step more than a lower bound. 



1. Introduction 



a. 

^ ■ Sorting is one of the most fundamental subjects of computer science and many 

! sorting algorithms including sorting arrays and networks can for example be found 

in volume 3 of Knuth's TAOCP [2]. However, for cellular automata there are only 
a few papers on this important topic. It should be pointed out that the algorithm 
described by one of the authors in an earlier paper ^ has running 3nk (despite the 
^ , title starting with the words "real time"). 

Tj- I We are not aware of any speedup techniques which would allow to turn this 

^ ■ CA or any other solving the problem into one running in real-time, i. e. exactly the 

Q . number of steps which is the length of the input. In the present paper we propose a 

^ ! sorting algorithm of binary numbers and its implementation on one-dimensional CA 

with nearest neighbors, which sorts n numbers of k bits each in exactly nk steps. 
^ ■ Sequential comparison based sorting algorithms need time ^(nlogn) where n is 

the number of elements to be sorted and it is assumed that each comparison can be 
done in constant time independent of the size of the elements. The latter assump- 
^ I tion is also usually made for parallel sorting algorithms. On linear arrays odd-even 

' transposition sort needs exactly n steps. But there the additional assumption is 

made that from the beginning each processor knows the parity of its own address 
(assuming those are e. g. 1 to n). Of course on parallel models from the second ma- 
chine class [1] like PRAM there are algorithms running in poly-logarithmic (or even 
logarithmic) time. But for these models one has to assume non-local communication 
in constant time when embedded into Euclidean space j4j. 

The rest of the paper is organized as follows. In Section [2] we precisely state 
the problem and the results obtained in this paper. In Section |3] we give a short 
proof for the lower bound of the sorting problem. The main aspects of our algorithm 
are presented in Section HJ The first version does not achieve a running time which 

©creative ^ ^ 
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matches the lower bound. For that two modifications are needed which are described 
in Section O 

2. Statement of problem and results 

We are considering one-dimensional CA with von Neumann neighborhood of 
radius 1 and assume that the reader is familiar with these concepts. Since we will 
not define our CA on such a low level there is no need to introduce any related 
formalism. 

We also assume that the reader is familiar with the firing squad synchronization 
problem [5] . If a block of k cells needs to be synchronized and there are generals at 
both ends, then synchronization can be achieved in exactly k steps. 

The inputs for our CA are provided as finite words with all surrounding cells in 
a quiescent state. Those cells will never be used during computations. 

The inputs which have to be processed by our CA are n numbers of equal length 
k, with the most and least significant bits marked as such. 

Problem 2.1. The input alphabet is A = {0, 1, (0, (1, 0|, 1|}. 

Each input w that has to be processed properly is of the form 

W = Wi---Wn= (XiT/i^il ■ ■ ■ {XnynZnl 

for some n > 1 and k > 2 where all Xi, Zi G {0, 1} and all yi G {0, l}'^"^. Each Wi 
is the binary representation of a non-negative integer (also denoted Wi) with most 
significant bit Xj and least significant bit Zi. 

For every such input after a finite number of steps a stable configuration of 
the form Wa-{i) ■ ■ ■ Wo-{n) has to be reached where a is a permutation of the numbers 
1, . . . , n such that ^^-(j) < Wa-{i+i) holds for all 1 < i < n. 

We note that it would have been sufficient to mark either the most or the least 
significant bits, because the other end of each number can then always be identified 
by looking at neighbor cells. 

The above problem statement also excludes the case of 1-bit inputs. For those 
the "traffic rule" 184 can be easily extended to do sorting, taking into account 
quiescent neighbors. The resulting CA works as follows: A cell in state 1 (0) becomes 
(1) if its right (left) neighbor is 0, otherwise it keeps its current state. 

In the following we always call a sequence of cells which initially stores one input 
number Wi the block i (or simply a block). 

It is clear that it can be necessary to move a number from block 1 to block n. 
This immediately gives a lower bound of {n — l)k steps for the sorting time. We 
shall see, that one can do slightly better: 

Theorem 2.2. Every CA solving Prohlem \2.1\ needs at least time nk — 1. 

Until now the fastest sorting algorithms known needed time cnk for some con- 
stant c > 1 with no obvious possibility to speed up the computation to run in nk 
steps. The main contribution of the present paper therefore is the following: 

Theorem 2.3. There is a CA (which does not depend onn or k) with von Neumann 
neighborhood of radius 1 solving the Sorting Prohlem \2.1\ in exactly nk steps. 
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3. Lower bound on sorting time 

First consider the input w = W1W2 ■ ■ ■ Wn where Wi = (Ol''~^l| and W2 = ■ ■ ■ = 
Wn = (10'^~^0|. Clearly this input sequence is already sorted and the rightmost bit 
of the output is a 0. 

If on the other hand we flip the leftmost bit of the first block only and consider 
the input 

W[ ■W2---Wn= (ll'^-'ll ■ (10'=-'0| • ■ ■ (lO'^-^Ol 

then the correct sorted output is 

W2---Wn-W[ = (lO'^-'Ol ■ ■ ■ (lO^-'Ol ■ (ll'^-'ll 

That is, by changing only the leftmost bit of the input the rightmost bit of the 
output must change. Hence no CA correctly solving the sorting problem can be 
faster then the distance between leftmost and rightmost bit which is nk — 1. 
This proves Theorem 12.21 



4. The base sorting algorithm 

The goal of this section is to prove a weakened version of Theorem 12.31 

Lemma 4.1. There is a CA (which does not depend on n or k) with the von Neu- 
mann neighborhood of radius 1 solving the sorting Problem \2.1\ in exactly k + nk 
steps. 

Before going into details and explaining some aspects of the CA on the cell level, 
we describe the main idea on the level of numbers. In particular we will employ a 
well-known simple algorithm for parallel sorting. 



4.1. Odd-even transposition sort 



Throughout this section, one can assume that N is an even number; this is the 
case needed for the CA below. 

Assume that numbers ai, 02, . . . , oat are given, arranged in an array of pro- 
cessors. In addition each processor (or each number) has a direction (indicated by 
arrows below). In each step each pair of adjacent processors whose arrows point to 
each other exchange their numbers. Both processors compare the two numbers; the 
left one keeps the smaller number, the right one the larger number, and both change 
their direction to the other neighbor. Thus one step can for example look like this: 
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where 



max(aj_i, a^ 



if ai points to the right 
if ai points to the left 



A missing neighboring number to the left is treated as if it were —00 and a missing 
neighboring number to the right is treated as if it were 00. 

It is known that odd-even transposition sort always produces the correct result 
after exactly steps (see e.g.|2]). 
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4.2. Outline of the base algorithm 



The algorithm which will be the basis for the improved construction in Section [S] 
will simply work as follows. Given an input of the form 
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first each number is copied and the two copies get opposite directions assigned. This 
will take k steps in the CA. Instead of storing two copies side by side they are stored 
in parallel: 
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These are now treated as = 2?t, numbers that are sorted using odd-even transpo- 
sition sort. In the end one would get 
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where a again denotes the "sorting permutation" as in Problem 12.11 

During the last sorting step the lower parts and the arrows are deleted, and the 
required output is obtained. 

It will become clear in the next subsection why it is actually useful to first copy 
each number and then seemingly spend twice as much time for the N = 2n sorting 
steps. It will be shown that each such step can be implemented in the CA in A;/2 
steps. Hence the total running time of the CA for this base version of the algorithm 
will he k + nk steps. 



4.3. Outline of the CA for the base algorithm 

In this section we will describe how the base sorting algorithm can be imple- 
mented on a CA. It will need n + 1 phases each of which needs exactly k steps. First 
comes a setup phase followed by phases 1, . . . , n. In order to avoid more complicated 
descriptions, throughout the rest of the paper we assume that k is even. 

If k is odd the middle cell of a block plays the role of the two middle cells one 
would have for the (even) case k + 1. In that case synchronization of a block using 
generals at both ends needs at least time — 2 = k — 1 and hence is possible in 
time k (as is the case for even k). 

Algorithm 4.2 (Setup phase). 

(1) During the setup phase the following tasks are carried out in each block: 

• By sending a signal from the left and the right end of the block the two 
middle cells are found. In each resulting sub-block the leftmost and the 
rightmost cell are marked as such. They are called L and R respectively. 

• Using an additional register the mirror image of the input number is 
computed. Below we call the register holding the original value left and 
the registers with the mirrored value right. 

The numbers in the left registers will play the role of the tul and the 
numbers in the right registers the role of the vfi used in the previous 
subsection. 

• Using synchronization, the preliminary phase is stopped after k steps. 
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(2) The leftmost and the rightmost cell of the whole input are set up as generals. 
Starting with the first step of phase 1 an algorithm is started to synchronize 
all nk cells after nk steps. 

A concrete example is shown in Figure [T]for two 6-bit numbers. The borders between 
sub-blocks are shown as double vertical lines. It can be noted that we also use 
markers for the most and least significant bits of the mirrored numbers. 
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Figure 1: An example configuration for two 6-bit numbers after the setup phase. 

In addition to the registers left and right the L- and i?-cells of each sub-block 
will make use of a register comp which will hold a (preliminary) comparison result 
(see also Figure |2]). Each comp register can hold one of the values =, < or >. Their 
use is described in Algorithm 14.41 below. 

The core idea is the following: 
CI. Bits are shifted in the left and right registers in the corresponding directions. 
C2. Whenever the most significant bits of two numbers arrive in a pair of adjacent 
i?||L-cells, the numbers are compared sequentially bit by bit. The smaller num- 
ber will be directed to move to the left and the larger one will be directed to 
move to the right. 

Part CI basically means that numbers are unconditionally shifted everywhere except 
at R\\L pairs. Since those are located at distance k/2 and numbers have length k 
this might look suspicious at first sight, because in general a number simultaneously 
gets compared at two such pairs. It will become clear later why this does not pose 
any problems. Ignoring it for the moment, CI is easy to implement: 

Algorithm 4.3 (Implementation of CI). 

(1) The cells which are not an R- or L-cell have a very simple behavior. 

• The left register gets its content from the left register of the right neigh- 
bor. 

• The right register gets its content from the right register of the left 
neighbor. 

(2) Analogously the left register of an L-cell gets its content from the left register 
of the right neighbor and the right register of an i?-cell gets its content from 
the right register of the left neighbor. 

(3) The same holds for the left register of a i?-cell and the right register of a 
L-cell if the comp register of that cell has value =. 

First of all, this part is needed during the first sub-phase of phase 1 when 
the R\\L pairs in the middle of a block still have no meaning. As will be seen 
this requirement is also consistent with the rules for later sub-phases. 

Each of the phases 1, . . . , n is subdivided into two sub-phases of k/2 steps each. 

We will now describe how the comparison of numbers is done. For this we use 
R.left to denote the left register of the i?-cell and similarly for the other cases. It 
will be seen that all information needed to update R.comp are also available in 
the neighboring L-cell, so that the invariant R.comp = L.comp can be maintained. 
Hence it suffices to describe the case of R.comp: 
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Algorithm 4.4 (Implementation of C2). If the two bits in R.right and L.left are 
most significant ones, then the new value of R. comp is determined as follows: 

{< if R.right < L.left 
= if R.right = L.left (4.1) 
> if R.right > L.left 

If the two bits to be compared are not most significant ones, the new value of R.comp 
is determined as follows: 

• If R.comp already has value < or > it is not changed. 

• If R. comp has old value = its new value is determined according to rule 14.11 
above. 

It remains to define how the new value for R. left is computed. That is most easily 
described as depending on the just defined new value of R.comp.: 

{R.right if R.comp is now < 
R. right if R. comp is now = 
L.left if R.comp is now > 

Dually the new value for L. right depends on the new value of L.comp (remember 
that always R.comp = L.comp): 

{L.left if L.comp is now < 
L.left if L.comp is now = 
R.right if L.comp is now > 

Figure [2] shows the relevant parts of computations for the comparison of two num- 
bers. In the left part of the figure initially the larger number is on the left, in the 
right part the smaller number is on the left. After the comparison in both cases the 
smaller number is on the left. We remind the reader that at the left resp. right end 
of the complete input a missing number is treated as — oo resp. oo. Hence a number 
arriving at a border is simply refiected. 

The following picture may be helpful: When the smaller number comes from 
the left and the larger from the right, then from the first (most significant) bit both 
numbers are refiected at the border between the R- and the L-cell. When the larger 
number comes from the left and the smaller from the right, then from the first (most 
significant) bit both numbers pass through the border. This is a correct picture even 
if the numbers have identical higher order bits, and hence in the beginning no cell 
knows which is the present case. Readers are encouraged to check Figure [2] again. 

At least from a formal point of view it is now straightforward to put the pieces 
together. An even better intuition of how the algorithm works may arise in the 
subsequent Subsection 14.41 when the correctness of the algorithm will be shown. 
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Figure 2: Two comparisons of 4 bits. On the left hand side the number coming from 
the left (in the right registers) is larger, on the right hand side the number 
coming from the right (in the left registers). If the complete numbers were 
longer, the comparisons would continue analogously. 



Algorithm 4.5. 

• First the setup phase is done as described in Algorithm 14.21 This phase takes 
k steps. It is stopped at the correct time using a synchronization algorithm 
in each block separately. Both ends of each block have to act as generals in 
order to achieve the required synchronization time. 

• Once all cells are synchronized they will work as described in Algorithms 14.31 
and 14.41 All numbers are shifted to the left or right, and whenever two 
most significant bits meet, the sequential comparison of the two numbers is 
started. The smaller number is sent to the left and the larger to the right. 

• This is repeated until the synchronization started immediately after the setup 
phase fires all nk cells after nk steps. 

It will be shown that at that point in time the left registers contain the 
sorted numbers. 



4.4. Correctness of the base sorting algorithm 

The correctness of algorithm I4.5l is essentially due to the correctness of odd-even 
transposition sort. This is basically a proof by induction. The main parts are stated 
in the following Lemma. 
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Lemma 4.6. Let X = \xX) andy = {Yy\ be two numbers with the k/2 higher order 
bits denoted by capital letters and the k/2 lower order bits denoted by small letters. 
Similarly let A = {Aa\ be the minimum of X and y and B = \bB) be the maximum. 
In other words one basic compare- and- exchange step of odd-even transposition sort 
transforms the pair X, y into the pair A, B. 

If the most significant bits of X) and {Y meet in an R\\L-pair (with the other 
higher order bits following) and if after k/2 steps the lower order bits \x and y\ 
arrive in the correct order, then during the first k/2 steps the higher order bits of {A 
and B) will be produced moving to the right directions, followed by the lower order 
bits a I and \h afterwards. 



This is basically a restatement of the construction from Algorithm 14.41 
Since the configuration produced by the setup phase corresponds to the initial 
configuration for the odd-even transposition sort, and since the preconditions of 
the if-statement in Lemma [4.61 are met, an induction teaches that in particular the 
higher order bits of each number after t phases are in the sub-block corresponding 
to its position in odd-even transposition sort after t sorting steps. 



Corollary 4.7. For all input sequences Wi, . . . , w„ Algorithm \4.5\ does sort the num- 
bers as required in Problem \2.1\ 

Proof. Since odd-even transposition sort does sort numbers in sorting steps, 
it immediately follows from Lemma [4.61 that at the end of phase n of Algorithm 14.51 
the higher order bits of each of the n input numbers and of its n copies are in the 
correct blocks. That is, there are the same higher order bits of a number Wi in each 
block twice, once in the left registers and once in the right registers. 

Furthermore it is clear that the most significant bit of the number stored in the 
left (resp. right) registers is in the L-cell (resp. i?-cell) of the block (not sub-block). 
This implies that the lower order bits are in the same block: 



R 



left 
right 



{k/2 higher order bits of Wi 


k/2 lower order bits of Wi 




\k/2 lower order bits of Wi 


k/2 higher order bits of Wi 


) 



Therefore the left registers of the full block hold the correct value. 



5. A sorting algorithm matching the lower bound 

We will save k steps of the running time of Algorithm 14.51 by starting with 
some comparisons not after k but already after k/2 steps and stopping the odd- 
even transposition sort k/2 steps earlier. A detailed description will be given in 
Section 15.11 The resulting algorithm still computes the correct output except for the 
rightmost block. This will be fixed in Section 15.21 

5.1. Speeding up the algorithm 

We describe the fast algorithm as three changes to Algorithm 14.51 

The first change is simple: Since we want to have the result after nk steps instead 

of A; -|- nk, the synchronization of all nk cells is not started after the setup phase, 

but in the very first step. 
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The second change concerns the computation of the mirror of a bit string as 
required by Algorithm 14.21 It can be implemented by shifting the original to the 
left, and letting the L-cell of the block act as reflector sending bits back to the 
right in the right cells. This means that after k/2 steps the lower order bits of an 
input number have arrived in the left registers of the left sub-block and the higher 
order bits are its right registers. It is useful if the shift to the left is not done using 
a temporary register but left. In Figure [3] the resulting process is shown for two 
adjacent 6-bit numbers. It can be seen that due to the simultaneous shift to the 
left, already after k/2 steps for the first time most significant bits meet. (This also 
determines the border of the sub-blocks.) Thus comparisons can be started k/2 steps 
earlier. The rightmost block now needs some special attention. Analogously to the 
other blocks we assume that symbols representing the "number oo" are shifted to 
the left from the rightmost cell. This is also depicted in Figure [31 

This completes the second change to Algorithm 14.51 
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Figure 3: Computing the mirror of two 6-bit numbers. Already after 3 steps most 
significant bits meet at the border between sub-blocks. 

The third change is the most complicated. For the shifts to the right two reg- 
isters are used, right and right2. The additional register right2 is empty almost 
everywhere. After each sub-phase there are only two adjacent sub-blocks in which 
right2 stores a number: 

(1) The initialization takes place in the leftmost block: During the first k/2 steps 
the reflected bits are shifted to the right in right and right2. And this is done 
only during the first k/2 steps, but not afterwards. 

(2) Then it happens for the first time that three most significant bits meet, two 
coming from the left and one coming from the right. 

In such a case the comparisons are done as follows: 

• The largest of the three numbers is shifted to the right in right2. 

• The other two numbers are compared and shifted as described in Algo- 
rithm 

A consequence of these rules is, that after k steps the right2 registers are used in 
the sub-blocks of block 1, after 2k steps in the sub-blocks of block 2, etc. and after 
nk steps in the sub-blocks of block n, and nowhere else. 
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Why do these three changes lead to a result after nk steps where in all blocks 
except the rightmost one the left registers contain the correct numbers? That the 
rightmost block can still be wrong can be seen in examples. 

First of all, reflecting Wi twice make sure that there are really 2n numbers which 
are sorted, each Wi twice. Therefore in the end each block will again contain twice 
the same number. This would in general not be the case if wi would be reflected 
only once, and the argument below would fail. 

Since we are still using odd-cvcn transposition sort, it is clear that after k/2 + nk 
steps the correct results arc obtained everywhere, but with the most significant bits 
in the middle of the blocks. Thus the result would look like 
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How can such a left sub-block arise? Without loss of generality assume that 
the input numbers are pairwisc different. Then in the left neighboring full block 
a smaller number (or —oo) is present. Hence the lower left half must have been 
reflected and going back k/2 steps, the bits must have been in the upper part. 
Analogously the upper right half must have been in the lower part. Except in the 
rightmost block (where the right2 registers are in use) there is no other possibility 
than that the lower order bits are in the remaining registers: 
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Thus the left registers hold the desired result. 

In the rightmost block it can happen that lower order bits are not stored in the 
left registers of the right sub-block but in the right2 registers of the left sub-block. 



5.2. Determining the rightmost output block 

In order to produce the largest number in the rightmost output block we use a 
separate algorithm which has to be run in parallel to the one described above. It 
will have flnished after nk steps. Remember that the input is a word 

W^Wi---Wn^ {xiyiZi \ ■ ■ ■ {XnUnZnl 

and the task is to have the maximum of the Wi be stored in the rightmost block in 
the end. This can be achieved as follows. 

Algorithm 5.1. 

(1) During the k steps of the setup phase a signal is sent from the right end of 
the input until it reaches the most significant bit of Wn, marking all cells as 
belonging to the last block. 

When the last block is synchronized after k steps, all cells have received 
the information and know that they have an additional task. 

(2) From the very first step all cells shift their input to the right using an addi- 
tional register. 
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Hence after 1 ■ k steps the number Wn-i reaches the last block, after 2 ■ k 
steps number w„-2 reaches the last block, etc. and after {n—l)k steps number 
Wi reaches the last block. 

(3) The cells in the rightmost block use two additional registers for storing num- 
bers; call them max and next. Register max is initialized in the very first 
step with Wn, register next is marked as not holding a value. Whenever the 
rightmost block ends a phase, in register next the number is stored that has 
arrived from the left in right because of the shifting. 

(4) If at the beginning of a phase register next has a valid number the rightmost 
block computes max -h- max(max, next). For this a signal is sent from left to 
right, that is from the most significant bit to the least significant bit, com- 
paring next and max. (While this comparison takes place the next number 
is already arriving in right.) 

As long as the same bit value is found in both registers nothing is changed 
and the signal moves one cell to the right. 

As soon as at some position for the first time different bit values are found, 
the following happens: 

• If next has a 1 bit, but max has a bit, max is smaller than next and 
this and all remaining bits are copied from next to max. 

• If next has a bit, but max has a 1 bit, max is larger than next and 
the signal is simply killed leaving max unchanged. 

It is straightforward to verify by induction that for 1 < i < n after phase i one has 

max = ma.x{wn-j | < j < i} 

and hence in the end max = max{wj | 1 < z < as required. Since wi is copied to 
next after {n — l)k steps the final correct value is stored in max k steps later, i.e. 
after nk steps as required. 

Taking together the changes to the base Algorithm 14.51 described in Section 15.11 
and the additional algorithm just described one gets a proof of Theorem 12.31 

6. Conclusion 

We have shown the sorting of n numbers with k bits can be achieved in (almost) 
real-time. Thus the situation is very similar to the firing squad synchronization 
problem: There is an algorithm which has — in our case except for one step — a 
running time matching a lower bound. 

Clearly, the number of states per cell required by our algorithm is finite but 
large, at least when compared to algorithms e. g. for the synchronization problem. 
We do not know how much the set of states can be reduced. 

The authors gratefully acknowledge a number of suggestions by the referees for 
improving the presentation of the algorithms. 
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